**Pipelining**

* Pipelining is an architectural technique for improving speed of programs
  + With 5-stage datapath, instructions are executed one at a time with one stage active in any clock cycle
  + Pipelining allow multiple stages for different instructions to be simultaneously active
* Pipeline organization
  + A new instruction enters the pipeline every cycle
  + Information about each instruction is carried through the pipeline as the instructions flow through the stages
    - Stored in interstage buffers – i.e. the interstage registers + more
    - B1 – feeds instruction fetched from fetch stage to decode stage
    - B2 – feeds execute stage; holds:
      * 2 operands from register file
      * Immediate value
      * Incremented PC (as return address)
      * Control signals settings
    - B3 – feeds memory stage; holds:
      * Result from ALU
      * Data to be written to memory
      * Incremented PC (from B2)
    - B4 – feeds writeback stage; holds:
      * Value to be written to reg file
        + Can be result from ALU, result from memory stage, or incremented PC
* Hazard – any condition that requires the stalling of pipeline
  + E.g. consider the instructions:

ADD R2, R3, #100

SUB R9, R2, #30

* + - R2 is written to in stage 5 of I1 (cycle 5)
    - R2 is read from in stage 2 of I2 (cycle 3)
  + There is a data dependency between these instructions – a data hazard is caused
    - In particular, this is a read-after-write (RAW) hazard
  + Pipeline must be stalled so that R2 is read in cycle 6
    - Control circuitry compares source operands of I1 (stored in B2) and dest operand of I2 (stored in B1) when I2 is decoded in cycle 3
    - Recognizes data dependency, stalls I2 by inserting NOPs (no-operation – one clock cycle of idle time; a.k.a. a bubble)
    - I2 is held in B1 while I1 is carried through the pipeline

Clock 1 2 3 4 5 6 7 8

I1 F D C M W

I2 F D (stall) C M W

* Operand forwarding – dependencies can be handled w/o stalling the pipeline
  + Although R2 is written to in cycle 5, the result of I1 is available at the end of cycle 3
  + This value can be passed to C stage (ALU) of I2 in cycle 4
  + Introduce MUX A before operand A input of ALU
    - Can select from RA or forwarded value from RZ

Clock 1 2 3 4 5 6

I1 F D C M W

I2 F D C M W

* Memory stalls – caused by delays in memory responses
  + i.e. if cache miss, accessing main memory ⇒ subsequent instructions are stalled

Clock 1 2 3 4 5 6 7

I1 F D C M W

I2 F D C M …

I3 F D C …

* Load-use stalls – RAW dependency even if cache hit + operand forwarding
  + E.g. consider:

LDR R2, (R3) ; result available in M stage (cycle 4)

SUB R9, R2, #30 ; R2 used in C stage (cycle 4)

* + I2 requires 1 cycle of stall

Clock 1 2 3 4 5 6 7

I1 F D C M W

I2 F D C M W

* Unconditional branch delays
  + Branch instructions alter sequential execution – can’t always just fetch I2 while I1 is being decoded
  + Consider an unconditional branch instruction I1 with target Ix
  + Target Ix is known for I1 in C stage (cycle 3), but I2 & I3 are fetched in cycles 2 & 3
  + Instructions fetched down the wrong path need to be squashed – results in branch penalty
    - Here I2 & I3 are squashed – branch penalty = 2 cycles
  + Can introduce an adder in D stage for computing branch address
    - Reduce branch penalty down to 1 cycle

Clock 1 2 3 4 5 6

I1 F D C M W

I2 F (squashed)

Ix F D C …

* Conditional branch delays
  + Introduce comparator in D stage for conditional branches
  + Branch penalty = 1
* Delayed branching
  + Branch delay slot – location immediately following branch
  + Instruction in delay slot is always fetched
  + Instead of conditionally discarding delay slot instruction, always let it complete execution
  + Find an instruction before the branch and reorder so that it’s in the delay slot (if data dependencies permit)
  + The instruction is useful & is always executed regardless of branch decision – branch penalty = 0
  + If not possible, inserting NOP in delay slot (always delay one cycle)
  + Ex:

ADD R7, R8, R9

B\_if\_[R3]=0 TARGET

I2 …

* + - I2 will be squashed; branch penalty = 1
    - No dependency exists between the ADD and B; reorder ADD into delay slot
  + Improved:

B\_if\_[R3]=0 TARGET

ADD R7, R8, R9

* + - I2 …
    - ADD is always performed, no instructions squashed; branch penalty = 0
* Static branch prediction
  + Assume branch is not taken
  + Branches at end of loops are usually taken, i.e. branch offset < 0
  + Branches at start of loops are usually not taken, i.e. branch offset > 0
  + Otherwise, assume randomly (accuracy ~50%)
* Dynamic branch prediction